7 Bayesian Treatment for Regularization

1 Bayesian Regularization

Recap the Bayesian regression in the model: y=Xβ+ε,εti.i.dN(0,σ2),βji.i.dUnif(C,C) for a large C. We have showed in here that (4.1)β|data,σN((XTX)1XTy,σ2(XTX)1) when C.

A different prior is βji.i.dN(0,C), then (4.2)β|data,σN((XTXσ2+IC)1XTyσ2,(XTXσ2+IC)1).
When C, (4.2) is (4.1).

Now consider prior β0,β1i.i.dN(0,C),β2,,βn1i.i.dN(0,τ2) for a small τ, i.e. βN(0,Q),Q=diag(C,C,τ2,,τ2), and (4.3)β|data,σN((XTXσ2+Q1)1XTyσ2,(XTXσ2+Q1)1).

(4.2) and (4.3) will be proved later.

The posterior mean is then (4.4)(XTXσ2+Q1)1XTyσ2=(XTX+σ2Q1)1XTy. (Note this is closely related to (2.1)) Note that Q1=diag{C1,C1,τ2,,τ2}. When C is large, Q11τ2J, so (4.4) becomes (XTX+σ2τ2J)1XTy, which matches 2.1 if λ=σ2τ2.

2 Bayesian Approach for Dealing with Unknown τ and σ

Assume logτ,logσi.i.dUnif(C,C), β|τ,σN(0,Q). Q is same as above. Then prior joint density is fβ,τ,σ(β,τ,σ)=fτ(τ)fσ(σ)fβ|τ(β)=I{eC<τ,σ<eC}4C2τσ(12π)n1detQexp(12βTQ1β)1τσ1det(Q)exp(12βTQ1β).
(Indicator has also been ignored because C is large). The likelihood is (12π)nσnexp(12σ2||yXβ||2), then the posterior is fβ,τ,σ|data(β,τ,σ)σn1τ1det(Q)exp(12(1σ2||yXβ||2+βTQ1β)).
Convert the power to quadratic: 1σ2||yXβ||2+βTQ1β=yTyσ22βTXTyσ2+βT(XTXσ2+Q1)β=(βμ)T(XTXσ2+Q1)(βμ)+yTyσ2μT(XTXσ2+Q1)μ=(βμ)T(XTXσ2+Q1)(βμ)+yTyσ2yTXσ2(XTXσ2+Q1)1XTyσ2, where μ=(XTXσ2+Q1)1XTyσ2. Plug into posterior: fβ,τ,σ|data(β,τ,σ)σn1τ1det(Q)exp(12(βμ)T(XTXσ2+Q1)(βμ))exp(yTy2σ2)exp(yTX2σ2(XTXσ2+Q1)1XTyσ2).
The dependence on β is simple through the quadratic which implies β|data,σ,τN(μ,(XTXσ2+Q1)1). This proves (4.2) and (4.3). Also fτ,σ|data(τ,σ)σn1τ1detQdet(XTXσ2+Q1)1exp(yTy2σ2)exp(yTX2σ2(XTXσ2+Q1)XTyσ2).

3 Comments on Bayesian Regularization

In practice, fτ,σ|data tends to prefer τ neither too small nor too large. Because fτ,σ|data(τ,σ)fdata|τ,σ(τ,σ)fτ,σ(τ,σ), and fτ,σ(τ,σ) is quite flat. (Note that there is a big difference between fdata|β,σ(data) and fdata|τ,σ(data))
Maximizing fdata|β,σ(data) leads to the unregularized LS estimate leading to overfitting. On the other hand, maximizing fdata|τ,σ(data) leads to a fairly small estimate of τ^ leading to a smooth trend function. This can be understood by noticing fdata|τ,σ(data)=fdata|β,σ(data)fβ|τ(β)dβ. When τ is large, fβ|τ(β) will be small simply because the normal density with τ2 will be flat for large τ. When τ is too small, fβ|τ(β) will be significant only for very smooth β but these β will have poor values for fdata|β,σ(data).


Model y=β0+β1x. But how to estimate β0,β1?
Well, given a guess, I do know how "bad" it is.

Denote our footprint lengths as x1,,xn, and heights as y1,,yn.
If β0,β1 are known, we predict heights as y^1,,y^n, with y^i=β0+β1xi.

Define Loss Function: L(β0,β1)=i=1n(yiy^i)2=i=1n[yi(β0+β1xi)]2.

Our ultimate goal is to minimize loss function. (1){Lβ0(β0,β1)=0,Lβ1(β0,β1)=0.

Denote x=1ni=1nxi,y=1ni=1nyi,xy=1ni=1nxiyi,x2=1ni=1nxi2, then the solution is (2)β1=xyxy(x)2x2,β0=yxβ1.